Multi-Scale Spatial Attention-Based Multi-Channel 2D Convolutional Network for Soil Property Prediction

Visible near-infrared spectroscopy (VNIR) is extensively researched for obtaining soil property information due to its rapid, cost-effective, and environmentally friendly advantages. Despite its widespread application and significant achievements in soil property analysis, current soil prediction models continue to suffer from low accuracy. To address this issue, we propose a convolutional neural network model that can achieve high-precision soil property prediction by creating 2D multi-channel inputs and applying a multi-scale spatial attention mechanism. Initially, we explored two-dimensional multi-channel inputs for seven soil properties in the public LUCAS spectral dataset using the Gramian Angular Field (GAF) method and various preprocessing techniques. Subsequently, we developed a convolutional neural network model with a multi-scale spatial attention mechanism to improve the network’s extraction of relevant spatial contextual information. Our proposed model showed superior performance in a statistical comparison with current state-of-the-art techniques. The RMSE (R²) values for various soil properties were as follows: organic carbon content (OC) of 19.083 (0.955), calcium carbonate content (CaCO3) of 24.901 (0.961), nitrogen content (N) of 0.969 (0.933), cation exchange capacity (CEC) of 6.52 (0.803), pH in H2O of 0.366 (0.927), clay content of 4.845 (0.86), and sand content of 12.069 (0.789). Our proposed model can effectively extract features from visible near-infrared spectroscopy data, contributing to the precise detection of soil properties.


Introduction
Soil is a critical natural resource, and the accurate and timely acquisition of soil property information is essential for ensuring soil health and achieving sustainable agriculture [1].Traditional methods typically entail on-site sampling and laboratory testing; however, these approaches are plagued by high costs, low efficiency, and environmental unfriendliness.In recent years, soil visible-near-infrared reflectance spectroscopy has emerged as a rapid, cost-effective, environmentally friendly, non-destructive, and reproducible analytical technique [2].Therefore, it has gradually emerged as an effective alternative to traditional methods.However, soil property prediction is challenging due to the spectral data's numerous spectral bands, strong collinearity, and intricate interrelationships.With the advancement of machine learning, numerous nonlinear regression algorithms have been developed and applied.Said et al. [3] conducted a comparative analysis of three regression techniques-Partial Least Squares Regression (PLSR), Support Vector Machine (SVM), and Multivariate Adaptive Regression Splines (MARS)-for the prediction of the organic matter and clay content in saline soils.Similarly, Yang et al. [4] employed four methods-PLSR, Least Squares Support Vector Machine (LS-SVM), Extreme Learning Machine (ELM), and the Cubist regression model-to forecast the soil organic matter and pH levels.Notwithstanding these advancements, these machine learning methods demonstrate computational efficiency and modeling capability limitations.
In contrast to conventional machine learning methods, deep learning models, particularly convolutional neural networks (CNNs), are highly effective in multi-dimensional data and large-scale problems due to their hierarchical structure, and the learning capabilities of the patterns of complex problems [5].They have been extensively utilized across domains such as image classification [6,7], natural language processing [8], and speech recognition [9].By leveraging sparse local connections and weight sharing, CNNs have been proven to effectively and automatically learn and extract local and abstract features from complex spectral data [10].By stacking multiple convolutional and pooling layers, CNNs can efficiently capture intricate patterns within the data, making them well-suited for soil property prediction tasks [11].In recent years, the application of deep learning in soil spectroscopy has become increasingly widespread.In 2015, Veres et al. [12] pioneered the integration of deep learning into soil spectroscopy, successfully validating the efficacy of one-dimensional convolutional neural networks (1D CNNs) in predicting specific soil properties.To extract deep feature information, Zhong et al. [13] proposed deep CNN models for the regression prediction of seven soil properties.Spectral data are commonly considered to exhibit a temporal structure [14].The presence of identical feature peaks at different positions in spectral data may indicate varying information, and the sequential nature of spectral data can affect the accuracy of soil property predictions [15].However, convolutional neural networks (CNNs) are insensitive to positional information during data extraction, which can lead to a decline in model performance.To address this issue, some studies have adopted recurrent neural networks (RNNs), which are better suited for handling sequential data.RNNs can use feedback connections to store historical information over time.Singh et al. [16] used long short-term memory (LSTM) to predict six soil physical and chemical properties from the LUCAS spectral library.The network can effectively capture and retain short-term and long-term dependencies in sequential data.Yang et al. [17] proposed a novel approach, the Combined CNN and RNN model (CCNVR), that exploits the strengths of both Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).Initially, the model employs CNN to extract features from the raw soil spectra.Subsequently, it utilizes a RNN to analyze the relationships among these features.This integration method effectively distills soil spectral features while also profoundly investigating the interconnections among these features.Furthermore, certain studies use two-dimensional transformations to convert one-dimensional spectral data into two-dimensional spectral images to enhance the feature extraction capabilities of the model.Padarian et al. [18] employed a short-time fast Fourier transformation to convert the raw spectra from the LUCAS database into two-dimensional spectrograms.Then, they used a 2D multi-task CNN to predict six soil properties.Li et al. [19] similarly used a short-time fast Fourier transformation to construct a dual-stream convolutional neural network model (Multi-CNN), which integrates both one-dimensional and two-dimensional convolutions to achieve accurate the prediction of multiple soil properties.Jin et al. [20] investigated four methods for converting one-dimensional spectra into two-dimensional spectral images: slicing and reshaping, the Gramian angular difference field, the Gramian angular field, and the Markov transition field.They combined the transformed images with the Swin Transformer to predict six soil properties.Additionally, they demonstrated that the spatial positional correlations preserved in the Gramian angular field method could enhance the information extraction capability of deep neural networks.
This paper introduces a multi-scale spatial attention mechanism module to tackle the issues previously outlined.The spatial attention mechanism, a pivotal element within convolutional neural networks, functions as an adaptive process that selectively focuses on key spatial areas, thus addressing the question of "where to focus" [21].This approach significantly improves the network's capacity to discern essential objects within the feature maps by identifying and emphasizing critical regions.It accomplishes this through the application of weighted operations across different areas of the input feature map along the spatial dimension, allowing the network to give precedence to pertinent information [22,23].We aim to enhance the prediction of soil properties by employing a multi-scale spatial attention mechanism.This mechanism captures information at different scales using convolutional kernels of varying sizes, thereby improving the feature extraction capabilities of convolutional neural networks.
Furthermore, researchers utilize various algorithms to preprocess spectral data to advance the creation of more robust calibration models for soil property prediction.This preprocessing endeavor aims to diminish or eradicate noise in the spectra while highlighting relevant information.Ultimately, this assists calibration models in recognizing the correlation between the input spectra and output soil properties [24].Common soil spectral preprocessing methods include Savitzky-Golay smoothing, standardization, and normalization techniques.Zhao et al. [25] employed four preprocessing methods-firstorder derivative, standard normal variate transformation, multiple scatter correction, and detrending-to process the original spectra.Tsakiridis et al. [26] utilized absorbance spectra and some preprocessed spectra developed using standard techniques as one-dimensional multi-channel inputs for their model.It has been confirmed that effectively combining different preprocessing techniques in one-dimensional multi-input methods produces more robust prediction results than single-input methods.However, research on two-dimensional multi-channel inputs in soil visible-near-infrared spectroscopy prediction studies is scarce.We aim to explore whether two-dimensional multi-channel input methods can improve the prediction accuracy, thus providing more reliable tools for soil property analysis.

The Soil Dataset
The soil spectral dataset utilized in this study is derived from the LUCAS soil spectral dataset.This dataset, collected during the 2009 survey, includes 19,036 topsoil samples from 23 European Union countries.All samples underwent standardization and chemical analysis to determine their primary topsoil characteristics, such as coarse fragments, particle size distribution (clay, silt, and sand), pH, organic carbon, carbonates, soluble phosphorus, total nitrogen, extractable potassium, and cation exchange capacity.Spectral data were captured using a diffuse reflectance spectrometer (XDS™ Rapid Content Analyzer, NIRSystems, Inc., Laurel, MD, USA) across a range of 400-2500 nm with a spectral resolution of 0.5 nm, resulting in 4200 data points per sample [27][28][29].In this study, seven soil properties were selected as target prediction variables: the calcium carbonate content (CaCO 3 , g•kg −1 ), cation exchange capacity (CEC, cmol(+)•kg −1 ), clay fraction (Clay, %), sand fraction (Sand, %), nitrogen content (N, g•kg −1 ), organic carbon content (OC, g•kg −1 ), and pH in H 2 O (pH).We considered all available soil samples in the dataset, encompassing both mineral and organic soils, without considering any additional information such as geographic origin or soil category.

Method
The entire experimental process was divided into three parts.First, the raw data underwent various preprocessing techniques.Second, the one-dimensional data were transformed into two-dimensional spectral images using the Gramian Angular Difference Field transformation.Next, the best combination of preprocessing methods for different soil properties for a multi-channel input was analyzed using the Vgg16 network model [30].Finally, the proposed deep learning model was employed to achieve high-precision predictions of soil property.

Preprocessing Methods
Spectral preprocessing techniques optimize raw spectral data, providing more accurate inputs for subsequent analysis and modeling and also acquiring various spectral information through different preprocessing methods that complement each other.To fully leverage this complementary information, we selected spectra processed with a series of common preprocessing methods, along with the original absorbance spectra, as multi-channel inputs for the model, with each spectrum forming an independent channel.Several preprocessing methods commonly used in soil science (such as SG filtering, standard normal variate transformation, and scatter correction) were chosen to create a spectral information pool.The following seven methods were selected to transform the original absorbance spectra: (1) standard normal variate transformation followed by detrending (SNV + DT); (2) the zero-order Savitzky-Golay filter with a window width of 9, followed by standard normal variate transformation (SG0-SNV); (3) the first-order Savitzky-Golay filter with a window width of 9, followed by standard normal variate transformation (SG1-SNV); (4) the secondorder Savitzky-Golay filter with a window width of 9, followed by standard normal variate transformation (SG2-SNV); (5) the zero-order Savitzky-Golay filter with a window width of 9, followed by multiple scatter correction (SG0-MSC); (6) the first-order Savitzky-Golay filter with a window width of 9, followed by multiple scatter correction (SG1-MSC); and (7) the second-order Savitzky-Golay filter with a window width of 9, followed by multiple scatter correction (SG2-MSC).The original spectra and the corresponding spectral transformations are depicted in Figure 1.
Sensors 2024, 24, x FOR PEER REVIEW 4 of 18 information through different preprocessing methods that complement each other.To fully leverage this complementary information, we selected spectra processed with a series of common preprocessing methods, along with the original absorbance spectra, as multi-channel inputs for the model, with each spectrum forming an independent channel.Several preprocessing methods commonly used in soil science (such as SG filtering, standard normal variate transformation, and scatter correction) were chosen to create a spectral information pool.The following seven methods were selected to transform the original absorbance spectra: (1) standard normal variate transformation followed by detrending (SNV+DT); (2) the zero-order Savitzky-Golay filter with a window width of 9, followed by standard normal variate transformation (SG0-SNV); (3) the first-order Savitzky-Golay filter with a window width of 9, followed by standard normal variate transformation (SG1-SNV); (4) the second-order Savitzky-Golay filter with a window width of 9, followed by standard normal variate transformation (SG2-SNV); (5) the zero-order Savitzky-Golay filter with a window width of 9, followed by multiple scatter correction (SG0-MSC); ( 6) the first-order Savitzky-Golay filter with a window width of 9, followed by multiple scatter correction (SG1-MSC); and (7) the second-order Savitzky-Golay filter with a window width of 9, followed by multiple scatter correction (SG2-MSC).The original spectra and the corresponding spectral transformations are depicted in Figure 1.

D Transformation Methods
In time series processing, the Gramian Angular Field (GAF) method [31] transforms time series data into image data.This technique retains the complete information of the signal while preserving its temporal dependencies.Visible near-infrared spectroscopy can be viewed as a type of time series.Utilizing the GAF transformation to preserve the spatial position correlations of spectral sequences enables data augmentation and improves the information extraction ability of neural networks [20].After converting sequence data into image data, we can fully utilize the advantages of convolutional neural networks in image classification and recognition and explore the methods suitable for deep learning algorithm models.We can obtain a two-dimensional GAF image for a given sequence { ,1, 2,..., }

2D Transformation Methods
In time series processing, the Gramian Angular Field (GAF) method [31] transforms time series data into image data.This technique retains the complete information of the signal while preserving its temporal dependencies.Visible near-infrared spectroscopy can be viewed as a type of time series.Utilizing the GAF transformation to preserve the spatial position correlations of spectral sequences enables data augmentation and improves the information extraction ability of neural networks [20].After converting sequence data into image data, we can fully utilize the advantages of convolutional neural networks in image classification and recognition and explore the methods suitable for deep learning algorithm models.We can obtain a two-dimensional GAF image for a given sequence X = {x t , 1, 2, . . . ,M} by following the steps listed below: To reduce the dimensionality of the sequence, this study employs the Piecewise Aggregate Approximation (PAA) method [32].Using this method, we obtain the aggregated sequence Sensors 2024, 24, 4728 5 of 17 X = {x t , t = 1, 2, . . .N}.It should be noted that in this study, the value of N is set to 64.The formula for the sequence X is as follows: where k = M N , N < M; Next, the data obtained from the first step X need to be processed using min-max normalization to scale its range to [0, 1].This will result in a new data set X.The specific transformation method is shown in Equation (2).
For the data obtained in the second step X, a polar coordinate transformation can be applied to obtain the corresponding angle and radius for each data point.
where ϕ i is the angle and r is the radius; Using Equations ( 4) and ( 5), the cosine of the sum of the angles and the sine of the difference between the angles for two different points can be calculated.Consequently, the Gramian Angular Summation Field (X GASF ) and Gramian Angular Difference Field (X GADF ) can be obtained.
In this study, we applied the GADF transformation, as shown in Figure 2.
X is as follows: where Next, the data obtained from the first step X need to be processed using min-max normalization to scale its range to [0, 1].This will result in a new data set  X .The specific transformation method is shown in Equation ( 2).
 min max min ( ) For the data obtained in the second step  X , a polar coordinate transformation can be applied to obtain the corresponding angle and radius for each data point.
where i φ is the angle and r is the radius; Using Equations ( 4) and ( 5), the cosine of the sum of the angles and the sine of the difference between the angles for two different points can be calculated.Consequently, the Gramian Angular Summation Field ( GASF X ) and Gramian Angular Difference Field (

GADF X
) can be obtained.

cos(
) sin( ) In this study, we applied the GADF transformation, as shown in Figure 2.

Construction of Multi-Channel Input
To validate the effectiveness of the GADF method, we generated single-channel 2D images from the original soil spectral data.The original spectral sequences and the 2D images were used to train 1D_Vgg16 and 2D_Vgg16 models.Table 1 presents the 2D_Vgg16 network framework in detail.The following hyperparameters were used: SGD was the optimizer, the learning rate was 0.001, the mean squared error was the loss function (MSELoss), the training batch size was 64 samples, and there were 100 training epochs.With the network structure and hyperparameters fixed, only the input data could affect the prediction results.Next, we applied the preprocessing methods mentioned in Section 2.2.1 to the original spectral sequences, obtaining a series of spectral information.Subsequently, we transformed the spectral information into 2D images.We combined these image data in various ways to construct input data with different channel numbers, which were then fed into the 2D_VGG16 model for training.
To investigate the relationship between the soil property prediction performance and the number of channels in the preprocessing method combination, we gradually increased the number of considered channels to observe the variations in the prediction performance of different properties.Firstly, considering only one channel, we selected one of the preprocessing methods mentioned earlier and obtained a one-channel spectral image by using a two-dimensional transformation as the input variable, denoted as NCC 1 .Next, considering two channels, we selected any two preprocessing methods and obtained a two-channel spectral image by using a two-dimensional transformation as the input variable, denoted as NCC 2 , and so on for other channels.According to the permutation and combination methods, the number of NCC 1 and NCC 2 combinations was 8 and 28, respectively (Table 2).Finally, we compared the prediction accuracy of each property under different channel inputs.We selected the preprocessing method combination with the highest prediction accuracy for each property as the input for that property's multi-channel, two-dimensional image.As illustrated in Figure 3, this paper introduces a two-dimensional convolutional neural network model with a spatial attention mechanism called CNNSANet.The model employs a hierarchical architecture divided into four stages, akin to certain studies in computer vision [33][34][35].Each stage comprises a downsampling layer followed by a sequential stack of blocks.Each block contains a multi-scale spatial selection mechanism module and a multi-channel information fusion module.Downsampling is performed using layer normalization and a 2 × 2 convolution layer with a stride of 2.

Structure of the CNN Network
As illustrated in Figure 3, this paper introduces a two-dimensional convolutional neural network model with a spatial attention mechanism called CNNSANet.The model employs a hierarchical architecture divided into four stages, akin to certain studies in computer vision [33][34][35].Each stage comprises a downsampling layer followed by a sequential stack of blocks.Each block contains a multi-scale spatial selection mechanism module and a multi-channel information fusion module.Downsampling is performed using layer normalization and a 2 × 2 convolution layer with a stride of 2. To enhance the network's focus on the most relevant spatial contextual information, we introduce a Multi-Scale Spatial Selection Mechanism (MSSM), as illustrated in Figure 4.This module can select feature maps from convolutional kernels at different scales.First, to extract rich contextual information features from the input X , we utilize a series of depth-wise separable convolutions with varying receptive fields.0 0 , ( ) Here, ( ) F  represents a depthwise separable convolution with a kernel size of ki.
Assuming there are N convolutional kernels, each kernel is further refined by a 1 × 1 convolution ( ) dw i F  , as shown in Equation (7).
To obtain more detailed and comprehensive feature information, it is possible to concatenate features obtained from different convolutional kernels with varying receptive field sizes.This approach offers the advantage of fully leveraging the multi-level information extraction capabilities of different convolutional kernels on the image, thereby further enhancing the model's representative capacity and performance.To enhance the network's focus on the most relevant spatial contextual information, we introduce a Multi-Scale Spatial Selection Mechanism (MSSM), as illustrated in Figure 4.This module can select feature maps from convolutional kernels at different scales.First, to extract rich contextual information features from the input X, we utilize a series of depth-wise separable convolutions with varying receptive fields.
Sensors 2024, 24, x FOR PEER REVIEW 8 of 18 Next, we employ the channel-wise average pooling method (represented as ( ) avg P  ) to process the spatial features, resulting in the spatial feature map SA being obtained through average pooling.Then, through convolutional processing, we transform the pooled features (with only one channel) into N spatial attention maps, denoted as  SA .

 ( )
To acquire individual spatial selection masks for each convolutional kernel, we apply the Sigmoid activation function to process each spatial attention map  i SA Here, ( ) σ  denotes the Sigmoid function.Following this, a corresponding spatial selection mask is employed to apply weights to the features extracted by various convolutional kernels.The weighted features are then combined using a convolutional layer ( ) F  , thereby producing the attention feature S: ) Finally, the input feature X is multiplied elementwise with S, yielding the final output Y. Furthermore, we propose a Multi-Scale Channel Information Fusion (MCIF) module to enhance the model's representative ability and performance, as depicted in Figure 5.This module improves the network's ability to learn complex features and enhance information fusion between channels.The MCIF module consists of the following components: a parallel depthwise convolution module with four different scales, a 1 × 1 convolution for channel compression and expansion to reduce the computational cost, and a residual connection.In the parallel depthwise convolution module with four different scales, each convolution processes one-fourth of the channels.The depthwise convolution kernels with sizes {3, 5, 7} effectively capture multi-scale information.The 1 × 1 depthwise convolution kernel also acts as a learnable channel-wise scaling factor, further enhancing the module's performance.This design ensures that features at different scales are fully utilized, improving the model's ability to recognize and learn complex features.Furthermore, the 1 × 1 convolution for channel compression and expansion helps reduce the computational costs.Finally, the residual connection better preserves and transmits the Here, F dw i (•) represents a depthwise separable convolution with a kernel size of k i .Assuming there are N convolutional kernels, each kernel is further refined by a 1 × 1 convolution F dw i (•), as shown in Equation (7).
To obtain more detailed and comprehensive feature information, it is possible to concatenate features obtained from different convolutional kernels with varying receptive field sizes.This approach offers the advantage of fully leveraging the multi-level information extraction capabilities of different convolutional kernels on the image, thereby further enhancing the model's representative capacity and performance.
Next, we employ the channel-wise average pooling method (represented as P avg (•)) to process the spatial features, resulting in the spatial feature map SA being obtained through average pooling.Then, through convolutional processing, we transform the pooled features (with only one channel) into N spatial attention maps, denoted as SA.
To acquire individual spatial selection masks for each convolutional kernel, we apply the Sigmoid activation function to process each spatial attention map SA i Here, σ(•) denotes the Sigmoid function.Following this, a corresponding spatial selection mask is employed to apply weights to the features extracted by various convolutional kernels.The weighted features are then combined using a convolutional layer F(•), thereby producing the attention feature S: Finally, the input feature X is multiplied elementwise with S, yielding the final output Furthermore, we propose a Multi-Scale Channel Information Fusion (MCIF) module to enhance the model's representative ability and performance, as depicted in Figure 5.This module improves the network's ability to learn complex features and enhance information fusion between channels.The MCIF module consists of the following components: a parallel depthwise convolution module with four different scales, a 1 × 1 convolution for channel compression and expansion to reduce the computational cost, and a residual connection.In the parallel depthwise convolution module with four different scales, each convolution processes one-fourth of the channels.The depthwise convolution kernels with sizes {3, 5, 7} effectively capture multi-scale information.The 1 × 1 depthwise convolution kernel also acts as a learnable channel-wise scaling factor, further enhancing the module's performance.This design ensures that features at different scales are fully utilized, improving the model's ability to recognize and learn complex features.Furthermore, the 1 × 1 convolution for channel compression and expansion helps reduce the computational costs.Finally, the residual connection better preserves and transmits the information about the original features.The following equation can represent the MCIF module: Sensors 2024, 24, x FOR PEER REVIEW 9 of 18 information about the original features.The following equation can represent the MCIF module: Figure 5. Multi-scale channel information fusion model.

Evaluation
The Root Mean Square Error (RMSE), Coefficient of Determination (R 2 ), and Ratio of Performance to Inter-Quartile Distance (RPIQ) are utilized to assess the training model's performance.These metrics are validated on the test set, facilitating an objective and thorough evaluation of the model's performance.RMSE is used to quantify the discrepancy

Evaluation
The Root Mean Square Error (RMSE), Coefficient of Determination (R 2 ), and Ratio of Performance to Inter-Quartile Distance (RPIQ) are utilized to assess the training model's performance.These metrics are validated on the test set, facilitating an objective and thorough evaluation of the model's performance.RMSE is used to quantify the discrepancy between the predicted values and the actual observations, and it is calculated as follows: R 2 is a statistical indicator used to evaluate the fit of a regression model.It represents how the model explains the variance in the actual data.The R 2 values range between 0 and 1, with higher values signifying the greater explanatory capability of the model.The calculation formula for R 2 is as follows: The RPIQ is used to measure the deviation between the predicted values and observed values.IQR represents the interquartile range of the observed values, while RMSE is the root mean square error between the predicted and observed values.The formula for calculating the RPIQ is as follows: All deep learning models were trained and tested on a single machine.They were implemented using PyTorch (version 1.11.0), and the training process was accelerated with an NVIDIA TITAN V 12GB GPU.

Results and Discussion
Before the experiment, we randomly split the spectral dataset into two subsets, with 70% of the data used for training and 30% for independent testing.The descriptive statistics for the seven soil properties of the calibration and test set samples are summarized in Table 3.The soil properties show a wide range of values, and the means and standard deviations of the soil properties in the calibration and test sets are similar, indicating a uniform distribution, indicating that the dataset was divided reasonably.We split the training set into five subsets using a five-fold cross-validation method for improving the model's generalization performance.Specifically, the training dataset was randomly divided into five equal-sized subsets.Then, we performed five iterations of training and validation.In each iteration, one subset was used as the validation set, while the remaining four subsets were used as the training set.Each iteration yielded a model, which we evaluated on the independent test set.The final evaluation result of the model was obtained by averaging the performance metrics of the five models generated from the five iterations.Sensors 2024, 24, 4728 10 of 17

Analysis of 2D Multi-Channel Inputs
Initially, we verified the effectiveness of the GADF method.As seen in Figure 6, the test performance of converting original spectral information into single-channel GADF images outperformed that of the 1D spectral sequences for each soil property.This observation indicates that preserving spatial positional correlations in the GADF method can enhance the information extraction capability of convolutional neural networks.

Analysis of 2D Multi-Channel Inputs
Initially, we verified the effectiveness of the GADF method.As seen in Figure 6, the test performance of converting original spectral information into single-channel GADF images outperformed that of the 1D spectral sequences for each soil property.This observation indicates that preserving spatial positional correlations in the GADF method can enhance the information extraction capability of convolutional neural networks.Table 4 shows the prediction accuracy for various soil properties using single-channel inputs built from the spectral information obtained via the proposed preprocessing methods and raw spectral information.For different soil properties, the improvement in model performance using different preprocessing combinations is limited, with some combinations even causing a decline in performance.For the five soil properties of CaCO3, N, CEC, pH, and Clay, the preprocessing methods that yielded the best prediction performance Table 4 shows the prediction accuracy for various soil properties using single-channel inputs built from the spectral information obtained via the proposed preprocessing methods and raw spectral information.For different soil properties, the improvement in model performance using different preprocessing combinations is limited, with some combinations even causing a decline in performance.For the five soil properties of CaCO3, N, CEC, pH, and Clay, the preprocessing methods that yielded the best prediction performance for single-channel 2D inputs were SG0 + SNV, SG1 + SNC, SG2 + SNV, SG0 + MSC, and SNV + Detrend, respectively.Compared to the results without using any preprocessing methods, the R 2 increased by 0.5−1.1%,while the RMSE values decreased by 1.3−5.9%.However, for the soil properties of OC and Sand, applying the previously mentioned preprocessing methods resulted in a decrease in model performance.This suggests that the single-channel 2D inputs created using these preprocessing techniques do not effectively enhance the relative positional information, leading to limited improvements in the prediction accuracy of the soil property content.Figure 7 illustrates the box plots representing the prediction accuracy for different soil properties using spectral information derived from various preprocessing methods and the original spectral data used to form different multi-channel 2D inputs.The outcomes are primarily consistent across different soil properties.Compared to the prediction accuracy of single-channel 2D inputs, the average coefficient of determination for multi-channel 2D inputs demonstrates a marked improvement and a significant reduction in RMSE.For instance, for OC, the RMSE of its multi-channel 2D input decreased by 3.06−6.51%,and the R 2 increased by 0.4−1.0%.However, the prediction accuracy for different soil properties does not always positively correlate with the number of channels.By comparing the average R² of different multichannel inputs, the optimal number of channels for each property can be determined, and the combination of preprocessing methods that yield the highest R² for that multi-channel input can then be selected.For OC, the optimal number of channels is three, with the highest prediction accuracy achieved using a three-channel 2D input constructed with SNV, SG1 + MSC, and SG2 + MSC methods.The optimal number of channels is seven for CaCO3, N, and CEC, eight for pH, five for Clay, and six for Sand.Table 5 presents the optimal number of channels for each property, the highest accuracy corresponding to that number of channels, and the preprocessing methods used.These findings suggest that multi-channel two-dimensional images constructed with diverse preprocessing methods can enrich the input information, facilitate data augmentation, and improve the predictive performance of soil properties.

Training and Evaluating the CNNSANet Model
Based on the multi-channel input analysis experiment results, we selected the 2D spectral images with the optimal number of channels for different properties as inputs (Table 5).Subsequently, we used the proposed CNNSANet model to predict seven soil properties.In our experiment, the loss function was the root mean square error, and we used stochastic gradient descent (SGD) with a batch size of 64. Figure 8 shows the loss variation over 100 training iterations.For the prediction tasks of the seven soil properties, the training loss and validation loss for OC, CaCO 3 , N, pH, and Clay decreased rapidly during the first 0−10 epochs and then stabilized, with the training and validation loss curves almost overlapping.For the soil properties CEC and Sand, the training loss and validation loss decreased slowly, and the validation loss exhibited significant fluctuations.This indicates that the prediction performance for these two properties is not as strong as for the other five properties.Overall, the loss of each model decreases with increasing training iterations, indicating that our models perform well in predicting soil properties and exhibit strong generalization capabilities.To evaluate the effectiveness of the MSSM block and MCIF block in the CNNSANet model, we conducted ablation experiments on our proposed spatial attention mechanism module as follows: We used single-channel 2D images constructed from raw spectra and multi-channel 2D images constructed using different optimal preprocessing methods for each soil property as inputs.Initially, we replaced the MSSM block with a 1 x 1 convolutional block, then used the MSSM block alone, and finally employed the MSSM block along with the MCIF block.As shown in Table 6, the MSSM and MCIF blocks significantly improved the performance.The MSSM block enhanced the R 2 by 0.4−0.9% and reduced the RMSE by 1.2−7.8%when predicting the seven soil properties.The MCIF block increased the R² by 0.7−2.6% and decreased the RMSE by 3.4−11.0%.These results indicate that the MSSM and MCIF blocks can improve the predictive performance of CNN, regardless of whether single-channel or multi-channel 2D images are used as input.This confirms the effectiveness of the MSSM and MCIF blocks.Our findings suggest that the proposed spatial attention mechanism enhances the feature extraction abilities of CNNs, leading to an improved soil property prediction performance.
validation loss decreased slowly, and the validation loss exhibited significant fluctuations.This indicates that the prediction performance for these two properties is not as strong as for the other five properties.Overall, the loss of each model decreases with increasing training iterations, indicating that our models perform well in predicting soil properties and exhibit strong generalization capabilities.To evaluate the effectiveness of the MSSM block and MCIF block in the CNNSANet model, we conducted ablation experiments on our proposed spatial attention mechanism module as follows: We used single-channel 2D images constructed from raw spectra and multi-channel 2D images constructed using different optimal preprocessing methods for each soil property as inputs.Initially, we replaced the MSSM block with a 1 x 1 convolutional block, then used the MSSM block alone, and finally employed the MSSM block along with the MCIF block.As shown in Table 6, the MSSM and MCIF blocks significantly improved the performance.The MSSM block enhanced the R 2 by 0.4−0.9% and reduced the RMSE by 1.2−7.8%when predicting the seven soil properties.The MCIF block increased the R² by 0.7−2.6% and decreased the RMSE by 3.4−11.0%.These results indicate that the MSSM and MCIF blocks can improve the predictive performance of CNN, regardless of whether single-channel or multi-channel 2D images are used as input.This confirms the effectiveness of the MSSM and MCIF blocks.Our findings suggest that the proposed spatial attention mechanism enhances the feature extraction abilities of CNNs, leading to an improved soil property prediction performance.
Table 6.The results of the ablation experiments on the MSSM block and MCIF block, using singlechannel 2D images constructed from raw spectra and multi-channel 2D images constructed with the optimal preprocessing method for each soil property.Note: SC indicates the input of single-channel 2D images based on raw spectra, whereas MC indicates the input of multi-channel 2D images constructed with the optimal preprocessing methods for each attribute.Figure 9 presents scatter plots of the measured versus predicted values for the seven soil properties using the CNNSANet model, effectively illustrating their distribution.Among the predicted soil properties, CaCO 3 and OC demonstrate the highest prediction accuracy (R 2 > 0.95).The best models for predicting N and pH achieve R 2 values of 0.935 and 0.93, respectively.However, the predictive performance for CEC and Clay is comparatively weaker, with R 2 values of 0.803 and 0.86, respectively, while Sand shows the lowest R 2 value of only 0.789.Figure 9 presents scatter plots of the measured versus predicted values for the seven soil properties using the CNNSANet model, effectively illustrating their distribution.Among the predicted soil properties, CaCO3 and OC demonstrate the highest prediction accuracy (R 2 > 0.95).The best models for predicting N and pH achieve R 2 values of 0.935 and 0.93, respectively.However, the predictive performance for CEC and Clay is comparatively weaker, with R 2 values of 0.803 and 0.86, respectively, while Sand shows the lowest R 2 value of only 0.789.

Comparisons of Different Methods
To demonstrate the superior performance of our model, we utilized the same optimal multi-channel 2D inputs for each soil property employed by other image processing models and conducted comparative analyses.We selected several representative algorithmic models: ResNet50, a deep convolutional network; Visual Transformer (ViT) [36], which combines natural language processing with image processing; and ConvNeXt, a next-generation convolutional neural network.Under consistent network hyperparameters, these

Comparisons of Different Methods
To demonstrate the superior performance of our model, we utilized the same optimal multi-channel 2D inputs for each soil property employed by other image processing models and conducted comparative analyses.We selected several representative algorithmic models: ResNet50, a deep convolutional network; Visual Transformer (ViT) [36], which combines natural language processing with image processing; and ConvNeXt, a next-generation convolutional neural network.Under consistent network hyperparameters, these models were trained to predict soil properties.The results of the soil property prediction performance (RMSE and R 2 ) are presented in Figure 10.The results indicate that our model outperforms other models and can be effectively used for soil property prediction.To further evaluate the predictive performance of our proposed modeling method on the soil attribute content, we compared the CNNSANet model with the two-dimensional convolutional neural network (2D-CNN) employed by Padarian et al. [18], the one-dimensional long short-term memory neural network (1D-LSTM) used by Singh and Kasana et al. [16], the two-dimensional Swin Transformer network (2D-Swin Transformer) utilized by Jin et al. [20], and the one-dimensional machine learning model (1D-PCR-Poly) proposed by Tavakoli et al. [37].As shown in Table 7, the CNNSANet model significantly improves the prediction performance for most soil properties.Compared to the 2D-Swin Transformer, which also uses 2D transformation, our model reduces the RMSE for OC, N, CEC, pH, Clay, and Sand by 17.9%, 23.1%, 23.7%, 32.2%, 21.1%, and 21.3%, respectively.This improvement is attributed to the multi-channel 2D images we constructed, which enhance the input information.Additionally, our proposed convolutional neural network, featuring multi-scale spatial attention, offers stronger feature extraction capabilities, leading to better feature fitting and a higher prediction accuracy.It should be noted that some studies utilized both organic and mineral soils from the dataset [18,20,37], while others focused only on mineral soils [17,26].Our approach considers organic and mineral soils as a single entity to enhance the model's generalization performance.To further evaluate the predictive performance of our proposed modeling method on the soil attribute content, we compared the CNNSANet model with the two-dimensional convolutional neural network (2D-CNN) employed by Padarian et al. [18], the one-dimensional long short-term memory neural network (1D-LSTM) used by Singh and Kasana et al. [16], the two-dimensional Swin Transformer network (2D-Swin Transformer) utilized by Jin et al. [20], and the one-dimensional machine learning model (1D-PCR-Poly) proposed by Tavakoli et al. [37].As shown in Table 7, the CNNSANet model significantly improves the prediction performance for most soil properties.Compared to the 2D-Swin Transformer, which also uses 2D transformation, our model reduces the RMSE for OC, N, CEC, pH, Clay, and Sand by 17.9%, 23.1%, 23.7%, 32.2%, 21.1%, and 21.3%, respectively.This improvement is attributed to the multi-channel 2D images we constructed, which enhance the input information.Additionally, our proposed convolutional neural network, featuring multi-scale spatial attention, offers stronger feature extraction capabilities, leading to better feature fitting and a higher prediction accuracy.It should be noted that some studies utilized both organic and mineral soils from the dataset [18,20,37], while others focused only on mineral soils [17,26].Our approach considers organic and mineral soils as a single entity to enhance the model's generalization performance.

Conclusions
This study proposes a CNN structure based on 2D multi-channel inputs and a multiscale spatial attention mechanism.Firstly, we find that the combination of multi-channel inputs and 2D spectral inputs effectively improves the prediction accuracy of various soil properties.We investigate the impact of different channel numbers of 2D inputs for seven properties on the prediction results for each property.Additionally, our proposed convolutional neural network model with spatial attention mechanism, CNNSANet, can better capture the spatial positional correlation information of 2D spectral images, enhancing the feature extraction capability of the convolutional neural network, thereby improving the prediction of soil properties.For the large-scale LUCAS dataset, the CNNSANet model improves the prediction accuracy and outperforms current methods.Unlike laboratory data, VNIR spectra collected in the field are influenced by multiple environmental factors such as the weather, light intensity, and humidity.These factors can introduce higher data variability, thus complicating soil property prediction.Based on the favorable results obtained in this study, we will evaluate our model using more challenging field-collected soil VNIR spectra in future research.
by following the steps listed below: To reduce the dimensionality of

Figure 2 .
Figure 2. The procedure for converting a visible-near-infrared spectral sequence into a GADF image is as follows: (a1) is the original spectral sequence, (a2) is the spectral sequence after PAA dimensionality reduction, (a3) is the polar coordinate transformation, and (a4) is the resulting GADF image.

Figure 2 .
Figure 2. The procedure for converting a visible-near-infrared spectral sequence into a GADF image is as follows: (a1) is the original spectral sequence, (a2) is the spectral sequence after PAA dimensionality reduction, (a3) is the polar coordinate transformation, and (a4) is the resulting GADF image.

Figure 3 .
Figure 3.The overall framework of the CNNSANet.

Figure 3 .
Figure 3.The overall framework of the CNNSANet.

Figure 6 .
Figure 6.RMSE and R 2 comparison between 1D raw spectral data and 2D single-channel GADF images constructed using the same 1D raw spectral data as inputs.

Figure 6 .
Figure 6.RMSE and R 2 comparison between 1D raw spectral data and 2D single-channel GADF images constructed using the same 1D raw spectral data as inputs.

Figure 7 .
Figure 7. Boxplot of prediction accuracies for different properties of 2D inputs constructed from spectral information obtained using various preprocessing methods and raw spectral information.

Figure 7 .
Figure 7. Boxplot of prediction accuracies for different properties of 2D inputs constructed from spectral information obtained using various preprocessing methods and raw spectral information.

Figure 8 .
Figure 8. Training and validation losses of the CNNSANet model for seven soil properties.

Figure 8 .
Figure 8. Training and validation losses of the CNNSANet model for seven soil properties.

Figure 9 .
Figure 9. Scatter plot of CNNSANet model for measured and predicted values of seven soil properties.

Figure 9 .
Figure 9. Scatter plot of CNNSANet model for measured and predicted values of seven soil properties.

Sensors 2024 , 18 Figure 10 .
Figure 10.Results of the CNNSANet and other deep learning models for soil property prediction.

Figure 10 .
Figure 10.Results of the CNNSANet and other deep learning models for soil property prediction.

Table 2 .
The number of permutations and combinations of different preprocessing methods after two-dimensional transformation.
Note: CN indicates the number of channels considered; PCN indicates the number of outcomes from permutation and combination; NCC indicates the number of combined channels.

Table 3 .
Information statistics of seven soil properties for training and testing sets.

Table 4 .
Test set results of seven soil properties (OC, N, CEC, pH, CaCO 3 ) for single-channel 2D input constructed using different preprocessing methods based on the Vgg16 network model.

Table 5 .
The highest accuracy and multi-channel combination method for different multi-channel numbers based on different properties.

Table 5 .
The highest accuracy and multi-channel combination method for different multi-channel numbers based on different properties.

Table 6 .
The results of the ablation experiments on the MSSM block and MCIF block, using singlechannel 2D images constructed from raw spectra and multi-channel 2D images constructed with the optimal preprocessing method for each soil property.SC indicates the input of single-channel 2D images based on raw spectra, whereas MC indicates the input of multi-channel 2D images constructed with the optimal preprocessing methods for each attribute.

Table 7 .
The comparison between the proposed CNNSANet model in this paper and other methods from previous studies.

Table 7 .
The comparison between the proposed CNNSANet model in this paper and other methods from previous studies.
Note: NA, not available.